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Safeguarding  Hessian  Approximations  in  Trust  Region  Algorithms 

Richard  G.  Carter 


Abstract.  In  establishing  global  convergence  results  for  trust  region  algorithms  applied  to 
unconstrained  optimization,  it  is  customary  to  assume  either  a  uniform  upper  bound  on  the 
sequence  of  Hessian  approximations  or  an  upper  bound  linear  in  the  iteration  count.  The  former 
property  has  not  been  established  for  most  commonly  used  secant  updates,  and  the  latter  has  only 
been  established  for  some  updates  under  the  highly  restrictive  assumption  of  convexity. 

One  purpose  of  the  uniform  upper  bound  assumption  is  to  establish  a  technical  condition  we 
refer  to  as  the  uniform  predicted  decrease  condition.  We  show  that  this  condition  can  also  be 
obtained  by  milder  assumptions,  the  simplest  of  which  is  a  uniform  upper  bound  on  the  sequence 
of  Rayleigh  quotients  of  the  Hessian  approximations  in  the  gradient  directions.  This  in  turn  sug¬ 
gests  both  a  simple  procedure  for  detecting  questionable  Hessian  approximations  and  several 
natural  procedures  for  correcting  them  when  detected. 

In  numerical  testing,  one  of  these  procedures  increased  the  reliability  of  the  popular  BFGS 
method  by  a  factor  of  two  (i.e.,  the  procedure  halved  the  number  of  test  cases  to  fail  to  converge 
to  a  critical  point  in  a  reasonable  number  of  iterations).  Further,  for  those  problems  where  both 
methods  were  successful,  this  safeguarding  procedure  actually  improved  the  average  efficiency  of 
the  BFGS  by  ten  to  twenty  percent. 

Key  words.  Unconstrained  optimization,  trust  region  methods,  secant  methods,  quasi- 
Newton  methods,  global  convergence. 
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1.  Introduction.  Trust  region  algorithms  are  an  important  class  of  iterative  methods  that  can 
be  used  for  solving  the  unconstrained  minimization  problem 

minimize  f  (ar)  /. 

i£E" 

where  /  :  En  — » ]R  is  continuously  differentiable  with  gradient  g :  Rn  — ►  RB  and  Hessian 
V2/  :  JR"  — +•  JRnXn .  At  each  iteration  k,  the  function  /  is  approximated  by  the  quadratic  model 

Mx)  =  f  {xk)  +  g{xk)T{x  -**)  +  ^-(x  -xk)rBk{x  -  xk)  (1.2) 

where  Bk  G  lRnXn  is  a  symmetric  matrix  intended  to  approximate  V2/  (a;*).  A  new  iterate  ar*+1  is 
then  generated  by  solving  (perhaps  approximately) 

minimize  i>k(x) 

*6M”  (1.3) 

s/t  II X  -xk  II  <  A*  . 

The  positive  scalar  Ak ,  known  as  the  trust  radius,  is  adjusted  at  each  iteration  to  ensure  that  ipk  is 
a  reasonably  accurate  approximation  to  /  over  the  trust  region:  {x:  \\x  -  xk  ||  <  A*}.  More  com¬ 
plete  descriptions  can  be  found,  for  example,  in  [8],  [13],  [24],  and  [27]. 

The  literature  contains  global  convergence  results  for  many  different  implementations  of  the 
trust  region  method.  If  the  Hessian  of  /  is  uniformly  bounded  and  /  is  bounded  below,  then  the 


condition 


\\Bk  II  <  ft).  0  <  /?o  <  oo 


is  sufficient  to  establish  lim  ||j*  |j  =0  for  most  implementations,  while  the  condition 

k — ►oo 


P*  ||  < /Vl  +  *),  0</?o<°o  (1.5) 

is  sufficient  to  establish  the  weaker  result  lim  inf  ||  gk  ||  =  0  (see,  for  example  [24]  and  [22]). 

k  — •■oo 

In  the  event  that  an  analytic  expression  for  the  Hessian  is  impractical  and  a  finite  difference 
approximation  is  too  expensive,  it  is  customary  to  use  one  of  the  secant  methods  to  compute  the 
model  Hessian  Bk  ([7],  [8]).  Unfortunately,  the  question  of  whether  these  methods  satisfy  bounds 
of  the  form  (1.4)  remains  open.  Condition  (1.5)  can  be  established  for  the  popular  BFGS  secant 
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method,  but  only  under  the  highly  restrictive  assumption  that  /  is  convex  [20]. 

One  of  the  ways  that  conditions  (1.4)  and  (1.5)  are  used  in  proving  global  convergence  is  to 
establish  that,  for  every  iteration  where  g(xk)^  0,  the  approximate  solutions  to  (1.3)  satisfy 

1>k{xk)-'l>k{*k+  i)  >  yfti  llg(^)  II  min  |At,  11  (1.6) 

or 

4>kM-'l>k{xk+i)  >  yfti  IM**)  II  min  |A*,  (1-7) 

respectively,  where  /?i  >  0.  However,  (1.6)  and  (1.7)  can  also  be  established  by  the  milder  condi¬ 
tions 

9MTBkg(xk)  <  /30g{xk)Tg{xk)  (1.8) 

and 

9MTBkg{xk)  </?0(l  +  k)  g(xk)T g(xk)  (1.9) 

respectively. 

Conditions  (1.8)  and  (1.9)  are  no  less  of  an  open  question  for  secant  methods  than  are  (1.4) 
and  (1.5).  However,  several  natural  methods  are  available  for  correcting  Bk  so  that 
9MTBkg{xk )  g{xk)T V2/ (xk)g(xk).  Under  the  mild  assumption  that  ||  V2/ (xk)  ||  is  uni¬ 
formly  bounded,  these  methods  guarantee  (1.8)  and  hence  (1.6). 

In  this  paper  we  present  several  procedures  for  safeguarding  Hessian  approximations.  These 
procedures  were  designed  with  the  following  goals  in  mind. 

(a)  Condition  (1.6)  must  be  satisfied  at  every  iteration  for  some  constants  /30  and  Px 
independent  of  k . 

(b)  The  test  used  to  determine  when  a  correction  is  to  be  made  should  be  inexpensive  and 
should  be  as  scale  invariant  as  possible. 

(c)  Any  corrections  made  should  not  change  the  invariance  properties  of  the  original 
method.  Furthermore,  if  used  in  conjunction  with  a  secant  method  which  produces 
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positive  definite  Hessian  approximations,  the  corrections  should  also  be  positive 
definite. 

(d)  In  numerical  testing,  the  safeguards  must  improve  the  overall  reliability  of  the  trust 
region  algorithm  without  significantly  harming  the  efficiency. 

Surprisingly,  design  goal  (d)  was  actually  exceeded  by  our  procedures.  Even  though  each  correc¬ 
tion  involved  extra  work,  the  number  of  iterations  typically  needed  for  convergence  decreased 
enough  so  that  the  overall  efficiency  was  improved  by  10  to  20%  for  our  test  set.  Moreover,  one  of 
our  procedures  decreased  the  number  of  test  cases  that  failed  to  converge  in  a  reasonable  number  of 
iterations  by  a  factor  of  two. 

It  should  be  pointed  out  that  (1.8)  is  not  of  itself  sufficient  to  imply  global  convergence. 
Bounds  such  as  (1.4)  and  (1.5)  are  not  only  used  to  establish  (1.6)  or  (1.7),  but  are  also  used  to 
establish  conditions  about  how  the  trust  radius  is  updated  at  each  iteration.  However,  it  is  shown 
in  [2]  that  the  sufficiency  conditions  on  the  trust  radius  updating  procedure  can  be  obtained  if  (1.4) 
is  replaced  by  the  weaker  condition 

—  PoPTP  <PTBkp  V  nonzero  p  £R"  .  (1.10) 

That  is,  global  convergence  can  be  established  provided  (1.8)  holds  and  the  eigenvalues  of  Bk  are 
bounded  away  from  —  oo.  For  typical  secant  methods  that  force  each  Bk  to  be  positive  definite, 
the  correction  procedures  presented  in  later  sections  are  sufficient  to  give  lim  ||  =0. 

k — ►oo 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  presents  nomenclature  and 
preliminary  theory.  Section  3  presents  the  procedures  for  safeguarding  and  proves  that  they 
enforce  (1.6).  Section  4  presents  our  test  results,  and  describes  some  of  the  alternatives  investi¬ 
gated  during  this  study.  We  conclude  the  paper  with  a  summary  of  results  and  a  brief  discussion 
of  some  possible  refinements  to  our  safeguarding  techniques. 


TR87-12 


June  1987 


Safeguarding  Hessian  Approximations 


4 


2.  Preliminaries.  The  following  notation  is  used  throughout  this  paper.  The  function 
/  :  ]Rn  ->  e1  has  gradient  g :  Rn  -+  Rn  and  Hessian  V2/  :  ]Rn  -*■  IRn x" .  The  vectors  {xk  }  are  the 
iterates  produced  by  the  trust  region  algorithm,  and  pk  =  xk+1  -  xk .  For  brevity,  }k  will  denote 
f  (xk)>  9k  wiil  denote  g(xk),  etc.  The  notation  ||  ■  ||  denotes  the  Euclidean  norm  when  applied  to 
a  vector  and  the  operator  norm  induced  by  the  Euclidean  norm  when  applied  to  a  matrix. 

The  predicted  function  reduction  at  the  kth  iteration  is  defined  by 


predk(p)  =  fk-i>k{xk+p) 

=  -  9k  P  —  y  P  T  BkP  • 


(2.1) 


The  function  U(  •••)  represents  whatever  procedure  we  are  originally  using  to  generate  or  update 
our  Hessian  approximations  {Bk}  (i.e.,  Bk+l  =  U (Bk ,  pk ,  gk+}  —gk )).  The  notation  w  denotes 
w/||w||  for  nonzero  w£  1RB.  For  any  nonzero  it£R*  and  symmetric  HelRnX",  c(B,w) 
represents  the  Rayletgh  quotient  of  B  in  the  direction  w : 


c(B ,  w)  =  wT Bw /wT  w  .  (2.2) 

This  notation  is  appropriate  since  c(Bk,w)  also  represents  the  curvature  of  the  quadratic  model  ipk 
in  the  direction  w . 

Using  this  notation,  (1.6)  becomes 

predk(pk)  >  ~/dl  ||?*  ||  min  {A*,  U?*  ||//?0}  .  (2.3) 


We  refer  to  this  as  the  uniform  predicted  decrease  condition.1  Similarly,  (1.3)  becomes 

minimize  (—  predk(p)) 

s/t  Up  II  <  Afc  , 

or  the  trust  region  subproblem,  and  (1.8)  becomes 

c{Bk,gk)  <  Po  ■ 


(2.4) 


(2.5) 


1  This  terminology  helps  distinguish  between  conditions  (1.6)  and  (1.7).  The  word  “uniform”  refers  to  the  fact  that 
the  denominator  of  the  last  term,  0O  ,  is  not  a  function  of  k . 
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A  variety  of  methods  are  available  for  computing  approximate  solutions  to  the  trust  region 
subproblem.  In  general,  these  methods  satisfy 

Predk(Pk)  >  @2 min {- predk(p)  s/t  p  eSk,  ||p  ||  <  A*}  (2.6a) 

and 

\\Pk  II  </?gA*  ,  (2.6b) 

for  some  subspace  Sk  and  constants  /92G(0, 1),  /?3>1.  Algorithms  which  satisfy  (2.6)  for  the 
choice  Sk  =  R"  are  usually  referred  to  as  Hebden-More{ see  [12],  [13],  and  [16]),  optimal  locally  con¬ 
strained  (OLC)  (see  [10]),  or  hookstep  (see  [8])  methods,  and  typically  require  between  one  and  two 
matrix  factorizations  per  iteration.  Other  methods  select  a  subspace  Sk  of  smaller  dimension  in 
order  to  avoid  these  factorizations.  The  dogleg  method  of  Powell  [18]  and  the  double  dogleg  of 
Dennis  and  Mei  [6]  solve  the  trust  region  subproblem  over  a  piecewise  linear  path  embedded  in 
span  {—  gk,  —  Bk  lgk}.  Steihaug  [28]  considers  a  conjugate  gradient  method  in  which  a  sequence  of 
trial  steps  pk,  *  =  1,  2,  •  are  generated  that  minimize  the  quadratic  model  over  a  sequence  of 
expanding  Krylov  subspaces  Sk  =  span  {—  gk ,  B ,—Bkgk, —  Bk‘  gk).  The  procedure  continues 
until  either  pk  is  a  sufficiently  accurate  global  minimizer  of  the  quadratic,  or  the  piecewise  linear 
path  defined  by  the  trial  steps  intersects  the  boundary  of  the  trust  region.  All  of  these  methods 
satisfy  (2.6)  with  Sk  defined  to  be  span  {—  gk }.  Given  this  minimal  property,  it  is  fairly  straight¬ 
forward  to  show  that  (2.5)  implies  the  uniform  predicted  decrease  condition  (2.3). 


LEMMA  1.  Let  wk  and  gk  be  nonzero  vectors  in  JR"  and  let  0*  be  the  angle  between  wk  and 
—  gk .  Consider  the  problem 


minimize  (  —  predk (p)) 
s/t  p  =  a  wk  ,  |[p  ||  <  A* 

where  c(Bk,  wk )  and  cos©*  are  not  both  zero.  A  vector  p  *  solves  (2.7)  only  if 

,  II <7*  ||  cos©*  a 

P  =  - rn - \ —  wk 

c(Bk,wk) 

with  ||p*  ||  <  A*  and  c(Bk,wk)  >  0,  or 


(2.7) 


(2.8) 
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P*  =  A*w*,  (2.9a) 

or 

p*  =  -Akwk.  (2.9b) 

Furthermore,  if  (2.8)  solves  (2.7)  we  have 

predk(p')=  j  ||  gk  ||2  cos2©*/e  (£* ,  wk) ,  (2.10) 

while  if  (2.9a)  or  (2.9b)  solves  (2.7)  we  have 

prtdk{p')  >  ycos©*  ||ff*  ||  A*  .  (2.11) 


Proof.  Problem  (2.7)  is  simply  the  minimization  of  a  quadratic  in  one  dimension  subject  to 
upper  and  lower  bounds.  Substituting  p  =awk  and  cos(©*)  =  -  gkwk/(  ||j*  ||  ||w*  ||)  into  (2.1), 
we  can  write 


q  (a)  =  —pred(  a  wk  ) 

~  —  a(cos ©*  || gk  ||  ||w*  ||)  +  ~  ||w*  ||2c(£*,  wk) 
and  transform  problem  (2.7)  into 

minimize  5(a) 
s/t  |a|<At/  ||u;t  ||  . 


(2.12) 


(2.13) 


(i)  If  c(Bk,  wk)  >  0,  it  is  easy  to  establish  that  (2.8)  and  (2.10)  hold  provided  the  constraint  is 
inactive.  If  the  constraint  is  active,  then  (2.9)  and  (2.11)  follow  from  (2.8)  and  (2.13)  since 
sign  (a’)  =  sign  (-  gkwk )  and  a *  ||u>*  ||  <  cos©*  ||j*  |  |/c  [Bk ,  wk )  so  that 

g(a‘)=  -  a*  \\wk  ||cos©*  ||ff*  ||  |l  -  || wk  ||  . C ^ZT * ’■ ■"* \  1 

(  2  cos©*  || gk  || 

<  -  a*  || w*  ||  cos©*  || <7*  || {1  —  %}  (2.14) 

<-  y  A*c°s©*  ||j*  ||  . 
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Hence  (2.11)  holds. 

(ii)  If  c (Bk ,  wh)  <0,  then  the  constraint  is  necessarily  binding,  and  (2.7)  is  solved  by  (2.9a) 
and/or  (2.9b).  Then  (2.12)  immediately  implies  (2.11).  □ 


THEOREM  2.  If  there  exist  02  >  0  and  /?3  >  1  such  that  (2.6)  is  satisfied  at  every  iteration 
and  gk  £  Sk ,  then  (2.5)  is  sufficient  to  imply  the  uniform  predicted  decrease  condition  with  0y  =  02- 
Moreover,  if  wk  £  Sk  is  a  nonzero  vector  satisfying 


c(Bk  ,wk)  <  0t 


(2.15) 


for  some  0 4  >  0,  then  (2.6)  implies 


predk{pk)>  -i-  02  cos  0*  ||j/*  ||  min  {a*,cos©*  ^ 


(2.16) 


where  Gk  is  the  angle  between  wk  and  —  gk . 


Proof.  Applying  Lemma  1  to  (2.6)  with  wk  =  —  gk  immediately  establishes  that  (2.5)  implies 
(2.3).  Equation  (2.16)  also  follows  immediately  from  Lemma  1.  □ 

3.  Safeguarding  procedures.  We  now  present  three  safeguarding  procedures.  The  first  is  sim¬ 
ple,  direct,  and  can  be  applied  to  any  method  for  generating  Hessian  approximations.  This  pro¬ 
cedure  first  compares  the  curvature  of  the  model  in  the  gradient  direction  with  an  estimate  of  the 
maximum  curvature  of  the  problem.  If  the  model  curvature  seems  too  large,  the  model  Hessian  is 
scaled  using  a  finite  difference  approximation  of  the  exact  curvature  c  (V2/  (ar*+i),</*+i).  The  posi¬ 
tive  constants  macheps  and  typx  represent  machine  epsilon  and  an  estimate  of  the  typical  size  of 

Ik  II- 

Procedure  1 

Given  k  >  1,  e*_!  >0,  xk,  xk+u  pk,  gk,  gk+l,  and  Bk,  do  the  following. 

(1)  Approximate  V2/  (ari+1)  by  some  method:  Bk+1  =  U(Bk, ...)  . 
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(2)  Estimate  the  maximum  curvature  of  the  problem: 

ck  =  max  {<:*_!,  p?(gk+l  -  gk)/PkPk }  •  (3.1) 

(3)  If  c(Bk+l,  gk+\)  >  ck  then  apply  a  correction  to  Bk+1: 

(a)  Approximate  c(V2/  (z*+i),  ?*+i)  by  finite  differences: 


l 


£  =  (macheps)z{typx)/  ||j*+1  ||  , 

(3.2a) 

-  _  f  k+\~  f  (z*+l  —  e  9k+ 1)  ~  £  9k+l9k+ 1 

c*  -  2  2  T 

(3.2b) 

£  9k+l9k+l 

Scale  Bk+l: 

If  ck  >  0,  then  set  a  =  ck/c  {Bk+U  gk+l ), 

else  set  a=ek/e(Bk+l,  gk+x). 

Set  Bk+l:  =  aBk+l. 


(4)  Exit. 

The  next  procedure  is  more  sophisticated  and  makes  use  of  the  properties  of  secant  updates 
explicitly.  Such  methods  are  designed  to  satisfy  the  equation 

U(B,p,y)p  =  y  .  (3.3) 

Thus,  if  p  =  pk  and  y  =  yk  =  gk+l  —  gk,  the  update  Bk+1  =  U(Bk,  pk,yk)  will  exactly  interpolate  the 
change  in  the  gradient: 

Bk+i(xk+l  -  xk)  =  gk+l  -  gk  .  (3.4) 

Such  methods  have  proven  to  be  very  successful.  Of  particular  importance  are  methods  which 
have  the  property  of  hereditary  positive  definiteness: 

(Bk  positive  definite  )  (S*+i  positive  definite.)  (3-5) 
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Many  secant  methods  have  this  property  provided  ykPk  >  0,  so  it  is  customary  to  forego  the 
update  if  ykVk  <  0-  That  is, 

(ykTPk  <0)  (H*+1  =  5*  )•  (3.6) 

Updates  that  satisfy  (3.5)  and  (3.6)  are  said  to  be  positive  definite  secant  updates. 

Our  last  procedure  corrected  £4+i  by  using  finite  differences  to  try  to  force 
c(Bk+i,  ff*+i)=  c(V2/  (z*+i),  <7*+i).  By  using  (3.3),  we  can  instead  use  finite  differences  to  try  to 
force  the  stronger  condition 

^k+iSk+i  —  ^2f  •  (3-7) 

Procedure  2 

Given  k  >  1,  ck-X  >  0,  xk,  xk+x,  pk,  gk,  9k+u  and  Bk,  do  the  following 

(1)  Approximate  V2/  (z*)  with  a  secant  update: 

2 Ik  =  9k+ 1  -  9k 

Bk+i  '■=  U{Bk,Pk,  Vk)  ■ 

(2)  Estimate  the  maximum  curvature  of  the  problem  using  (3.1). 

(3)  If  c(Bk+i,  9k+i)  >  ck ,  then  apply  correction  to  Bk+X. 

e  =  macheps  [typx)/  ||p*+1  ||  , 

Pe=~e9k  + 1, 

Ve=  9{xk+l  +  Pt)~  9k+ 1  • 

If  ply<.>  0,  then 

Bk+i:=U(Bk+i,Pc,y  c), 

Else 

Bk+1:  —  {ck  / c  (Bk+i>  S*+i))  Bk+ 1 . 

End  If. 

(4)  Exit. 


(3.8a) 

(3.8b) 


(3.9a) 

(3.9b) 

(3.9c) 

(3.10) 

(3.11) 
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A  similar  procedure  has  been  used  by  Dixon  [9]  for  constrained  optimization  with  a  com¬ 
pletely  different  justification.  Let  /  be  the  Lagrangian  function:  /(x,  X)  =  f  (x)  +  \T A(x),  where 
h :  IR"  — ►  ]Rn  is  a  set  of  m  equality  constraints  and  X  £  lRm  is  the  set  of  Lagrange  multipliers  of 
the  constrained  problem.  Dixon’s  method  performs  an  extra  update  on  the  approximation  to  the 
Hessian  of  the  Lagrangian,  denoted  Bk .  This  update  is  of  the  form 

4  =  U(Bk>eVxl(xk,\k),V,l(xk+eVxl{xk,\k))-Vxl(xk,\k)),  (3.12) 

and  is  included  to  ensure  that  the  search  directions  selected  are  descent  directions  with  respect  to  a 
particular  merit  function. 

Both  hereditary  positive  definiteness  and  the  UPD  condition  are  easily  established  for  both  of 
these  procedures  under  very  mild  conditions  on  /  .  The  results  are  as  follows. 

THEOREM  3.  Let  c0  >  0  and  positive  definite  Bi  be  given.  If  the  update  U(B  ,p  ,y)  used  by 
procedure  (l)  or  (2)  satisfies  (3.5)  and  (3.6),  then  each  of  these  procedures  generates  a  positive 
definite  Hessian  approximation  at  every  iteration. 

Proof.  Since  c0  >  0,  we  have  ck  >  0  for  all  k .  Now,  suppose  that  for  all  i  <k  ,  B{  is  posi¬ 
tive  definite.  Then  the  “uncorrected”  Bk+i  will  be  positive  definite  by  hypotheses  (3.5)  and  (3.6). 
In  procedure  (1),  a  is  therefore  positive  and  the  corrected  Bk+1  will  still  be  positive  definite.  For 
the  same  reason,  any  corrected  Bk+1  generated  by  (3.11)  will  be  positive  definite,  and  by 
hypotheses  (3.5)  and  (3.6),  any  Bk+l  generated  by  (3.10)  will  be  positive  definite.  Our  result  fol¬ 
lows  by  induction.  □ 

THEOREM  4.  Let  c0  >  0,  macheps  >  0,  and  typx  >  0  be  given.  If  g  is  Lipschitz  continuous 
so  that  for  some  L  <  oo  ,  || j/(ar)  -  g{y)  ||  <  L  \\x  -  y  ||  for  all2  x,  y  £11"  ,  then  the  curvature 

estimates  {ck\  generated  by  procedures  (1)  and  (2)  are  uniformly  bounded.  Furthermore, 
{c(Bk,  gk)}  is  uniformly  bounded  above  for  both  procedures. 

2  The  assumption  of  Lipschitz  continuity  over  the  whole  of  R”  is  made  for  convenience  rather  than  necessity.  A 
smaller  region  can  be  used,  such  as  any  open  convex  set  that  contains  the  level  set  of  /  at  x0. 
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Proof.  From  the  Cauchy-Schwarz  inequality  and  Lipschitz  continuity  of  g,  we  have 
Pk (i/i+i  9k) / Pk  Pk  i  and  hence  ck  <  max  {c0,L}  .  In  either  procedure,  a  correction  is  done  if 

c(Bk+i,  &.+1)  >  ck.  Now,  if  ck  <0  in  procedure  (1)  or  pjyc<0  in  procedure  (2),  then  the  correc¬ 
tion  is  simply  Bk+1:  =  (ek/c  (Bk+1,  gk+l))  Bk+l,  resulting  in  c(Bk+1,  <fe+1)  <  max  {e0,  L}.  The 
remaining  two  cases  are  as  follows. 


(i)  In  procedure  (1),  ck  is  computed  by  (3.2b).  Now, 

l 

f  k+i  —  / (x*+i  —  £  ff*+i)  =  J  9 {Xk+i  ~  ^  9k+i)T Qk+M^ 


so  that 


(3.13) 


2  T 
e  9k+i  9k+ 1  o 


f(£  9k+ 1  ){g{xk+i  -  *£  9k+ 1)  -  ?*+i)dX 


< 


■/  llfc+ill  \\9{xk+i-\egk+1)- gk+1\\d\ 


<\\9k+l\\2Jo 

2  1 

<  7]|gt+1 1|  /L  lk+i-X€fc+,-^+,  II  dx 


<  L 


(3.14) 


Hence,  if  ck  >  0,  Bk+l  is  replaced  by  ( ck/c(Bk+l ,  j*+1))i?*+1,  resulting  in  an  approximation 
which  satisfies  c  (Bk+1,  gk+\)  <  L . 

(ii)  In  procedure  (2),  a  correction  is  made  by  an  extra  secant  update  provided  yj  pe  >  0,  where 
p c  and  y c  are  given  by  (3.9).  The  corrected  Bk+1  will  satisfy  Bk+i  pc=  yc  ■  Premultiplying 
by  gk+ 1 ,  using  the  Cauchy-Schwarz  inequality,  and  invoking  the  Lipschitz  continuity  of  g 
allows  us  to  write 

e  9k+lBk+l9k+l  =  9k+ 1  (?(**+!  —  £  9k+i)  ~  0*+i) 

<  ll?*+i  \\TL  lk*+i-£y*+1-a:i+i  II  (3-15) 

=  ll?*+i  II2 » 

and  thus  c(Bk+i,  gk+i)  <  L  .  Therefore,  procedures  (1)  and  (2)  guarantee  that  in  every 
situation  c(Bk+i,  jr*+1)  <  max  (c0,  L}  .  □ 
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The  third  procedure  we  present  is  similar  to  procedure  (1),  except  any  scalings  of  the  Hessian 
approximations  are  done  before  the  update.  This  approach  is  motivated  by  the  self  scaling  algo¬ 
rithms  of  Oren  and  Luenberger  [17]  and  the  suing  algorithm  of  Dennis,  Gay  and  Welsch  [4],  Self 
scaling  algorithms  set  a  =  | Pk{9k+\  —  9k)  \/Pk'BkPk,  or  equivalently 


\PkUk+i  ~  9k)  I  1 
PkPk  c{Bk,pk) 


(3.16) 


The  matrix  Bk+l  is  then  generated  by 

Bk+i  =  U(aBk,pk,yk)  .  (3.17) 

Sizing  algorithms  are  similar,  except  that  the  modification  is  only  done  if 
e(Bk,pk)>  |  Pk{gk+i  ~  Qk )  I  I  PkPk-  Our  procedure  is  obtained  by  substituting  ck  for 
\Pk{9k+i  ~  9k)  \/PkPk  and  c(Bk,  gk+l)  for  c(Bk,pk)  in  the  above  formula. 


Procedure  3 

Given  k  >  1,  ck_x  >  0,  xk,  xk+1,  pk,  gk,  gk+l,  and  Bk,  do  the  following. 

1)  Estimate  the  maximum  curvature  of  the  problem  using  (3.1). 

2)  Set  ak  =min{l,  ck/c(Bk,  0*+1)}  . 

3)  Set  Bk+1=U  ( ak Bk,pk,yk)  for  yk  =  gk+i  -  gk  ■ 

4)  Exit. 


We  shall  prove  results  for  this  procedure  only  for  a  specific  secant  update,  the  BFGS  (see,  for 
example,  [7],  [8]): 

if  yTp  <0 


Bbfgs{B ,  p ,  y)  — 


B 


BppT  Bt  yy 


(3.18) 


{  P  Bp 


T 

y  p 


otherwise  . 


This  is  probably  the  most  popular  positive  definite  secant  update  in  use  today. 


TR87-12 


June  1987 


Safeguarding  Hessian  Approximations 


13 


THEOREM  5.  Let  c0  >  0  and  positive  definite  B\  be  given.  Then  procedure  (3)  generates  a 
positive  definite  Hessian  approximation  at  every  iteration  if  U  =  Ubfgs- 

Proof.  Since  c0  >  0,  each  ck  >  0  and  every  ak  >  0.  Thus,  if  Bk  is  positive  definite,  aBk  is 
positive  definite.  Proof  that  Ubfgs  is  a  positive  definite  update  satisfying  (3.5)  can  be  found,  for 
example,  in  [5].  Since  Bx  is  positive  definite,  our  result  follows  by  induction.  □ 

The  next  theorem  uses  the  highly  restrictive  assumption  that  /  is  convex,  and  is  therefore 
not  as  strong  as  our  results  for  procedures  (1)  and  (2).  However,  this  result  is  stronger  than  results 
for  the  standard  BFGS,  in  that  we  achieve  c(Bk,  gk)<./30  (which  is  associated  with  the  strong 
result  lim  ||  =0)  rather  than  c(Bk,  gk)  </90(l  +  k)  (which  is  associated  with  the  weaker  result 

*— «-oo 

lim  inf  ||  =0.)  The  proof  is  almost  the  same  as  that  used  for  the  standard  result  (see  [19], 

[21]). 


THEOREM  6.  Let  c0  >  0  and  a  positive  definite  matrix  B1  be  given.  If  the  Hessian  of  /  is 
uniformly  bounded  and  /  is  convex  on  1R“,  then  procedure  (3)  with  U  =  Ubfgs  generates  a 
sequence  of  Hessian  approximations  for  which  c{Bk,  gk)  is  uniformly  bounded. 


Proof.  From  Theorem  5,  we  know  that  each  Bk  is  positive  definite.  Since  V2/(ar)  is  uni¬ 
formly  bounded,  there  exists  L  such  that  ||  V2/  (z)  ||  <  L  for  all  x  £Rn.  Now, 
l 

9k+i  —  9k  =  /  V2/  {xk  +  \pk  )pkd\,  so  we  have 
0 

1 

Pk(9k+i~9k)  =  /  P*rV2/  (zk  +\pk)Pk  d\  <  L  \\pk  ||2  .  (3.19) 

o 

From  this  we  can  deduce  that  ck  <max{c0,  L},  and  hence  c(akBk,  (fc+i)  <  max  {c0,  L)  . 


If  pkyk  <0,  we  have  Bk+l=akBk  and  c(Bk+1,  gk+1)  <max  {c0,  L}.  Otherwise, 


■B*+i  —  <*kBk 


ak 


BkPkPkTBkT 
Vis Bk  pk 


+ 


VkVk 

yhk 


(3.20) 


l  l 

Now,  define  Hk  =  JV2/  ( xk  +  >^pk)d\  so  that  Hkpk  =  J  V2/  ( xk  +  \pk)pkd\  =  gk+1  -  gk  =  yk  . 
o  o 


TR87-12 


June  1987 


Safeguarding  Hessian  Approximations 


14 


From  the  convexity  of  /  ,  we  see  that  Hk  is  positive  definite  or  positive  semidefinite,  and  thus  there 
exists  Hk‘  satisfying  {Hk2)T  (Hkh)  =  Hk.  Similarly,  there  exists  Bkh  such  that  {Bkk)T  {B^)  =  Bk . 
Substituting  these  expressions  into  (3.20)  yields 


'A\T 


Bk+ 1  =  ®k{Bk  ) 


I- 


(Bk*Pk)(Bk*Pk)T 
(Bk»Pk)T  (Bk»Pk) 


U>\T 


Bk  +  (Hk) 


(j H?pk){H?Vk)T 
(H?Pk)T  (Hk*Pk) 


H? 


(3.21) 


Defining  vk  =  Bk  >Apk  and  wk  =  Hk&pk  leads  to 


Qk+\Bk+\gk+i  =  ak(BkAgk+i)T  {I  —  vk  vk){Bk¥jgk+1)  +  (Hk,igk+l)T  (wk  wf)(Hk‘gk+1) 

<«*  ||/-M*r||a  l|5*^+1||2+  1 1  wk  wk  1 1 2  \\H?gk+l  II2  (3.22) 

=  <Xk\\Bk%+1\\2+\\Hk%+1\\>> 

so  that 


c  {Bk+ 1)  Qk+  i)  <cjfe+L<c0  +  2L  . 


(3.23) 


4.  Numerical  results.  The  three  procedures  of  the  last  section  were  implemented  in  a  trust 
region  algorithm  based  on  the  Dennis/Schnabel  optimization  code  [8],  Approximate  solutions  to 
the  trust  region  subproblem  (2.3)  were  computed  with  an  optimal  locally  constrained  procedure 
([8],  [10],  [12],  [16]).  The  BFGS  method  was  used  to  generate  updates  U(B ,  p ,  y). 

Twenty-six  problems  were  selected  from  the  standard  test  set  of  More',  Garbow,  and 
Hillstrom  [14].  These  problems  were  NPROB  =  1,  2,  ...,  18  with  the  default  problem  dimensions, 
and  the  variable  dimension  problems  6,  7,  8,  9,  13,  14,  and  15  with  dimensions 
n  =  10,  9,  18,  6,  6,  10  and  20  respectively.  An  “easy  quadratic”  problem  with  n  =  4  was 
included  as  an  additional  control. 

Table  (1)  shows  a  representative  sample  of  the  results  obtained  by  our  procedures  when  the 
algorithm  was  executed  from  the  standard  starting  point.  The  numbers  0  to  3  in  the  first  column 
represent,  respectively,  the  unmodified  BFGS,  procedure  (1),  procedure  (2),  and  procedure  (3).  The 
next  two  columns  represent  the  problem  number  and  dimension.  “Itn”  represents  the  number  of 
iterations  performed,  and  “C”  represents  the  number  of  corrections  done.  The  “#f”  and  “#g” 
columns  represent  the  number  of  function  and  gradient  evaluations  required.  The  “/*”  and 
“  ||  g*  ||”  columns  represent  the  function  value  and  the  norm  of  the  gradient  at  the  final  iteration. 
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The  two  numbers  in  the  “performance  measure”  column  represent  two  different  ways  of 
measuring  algorithm  performance.  Criteria  (A)  is  defined  as  the  sum  of  the  number  of  function  and 
gradient  evaluations,  while  criteria  (B)  is  the  number  of  function  evaluations  plus  n  times  the 
number  of  gradient  evaluations.  Thus  the  first  column  is  the  best  measure  if  gradient  evaluations 
are  inexpensive,  while  the  second  column  is  the  best  measure  if  gradient  evaluations  are  much  more 
expensive  than  function  evaluations. 

The  results  for  all  twenty-six  test  problems  are  summarized  in  Table  (2).  Corrections  were 
made  in  10-20%  of  the  iterations3.  Even  though  these  corrections  required  an  extra  function 
evaluation  in  procedure  (1)  and  an  extra  gradient  evaluation  in  procedure  (2),  the  reduction  in 
total  number  of  iterations  is  sufficiently  large  to  actually  reduce  the  number  of  equivalent  function 
evaluations.  Problem  8  with  n  =  18  is  the  most  striking  example  of  this.  The  algorithm  using  the 
unmodified  BFGS  is  still  far  from  the  solution  after  200  iterations.  Procedures  (l)  and  (3)  have 
not  converged  either,  but  are  much  closer  (as  measured  by  ||ff20oll  and  / 200  being  roughly  100 
times  smaller  for  procedures  (1)  and  (3)  than  for  the  BFGS).  Procedure  (2)  is  the  clear  winner, 
having  converged  in  89  iterations. 

Table  (3)  presents  the  data  from  Table  (2)  in  a  normalized  format.  Each  of  the  performance 
indicators  shown  in  the  table  is  normalized  by  the  value  from  the  unmodified  BFGS  method  with 
the  exception  of  the  second  column  under  “normalized  count  of  corrections,”  which  instead  denotes 
the  fraction  of  iterations  at  which  a  correction  was  performed.  The  first  4  rows  represent  the  com¬ 
plete  data  set,  while  rows  5  to  8  represent  only  those  test  cases  for  which  all  of  the  methods  suc¬ 
ceeded.4  We  see  that  options(l)  and  (3)  improved  most  performance  indicators  by  roughly  10%, 
while  procedure  (2)  increased  performance  by  15  to  20%. 

To  investigate  how  reliable  our  procedures  were,  we  ran  the  previous  set  of  test  cases  again 
with  different  starting  points  (10*  and  100*  the  default  values)  and  included  an  additional  12 


8  It  is  worth  mentioning  that  (3.11)  (the  naive  correction  done  in  procedure  (2)  if  the  secant  correction  cannot  be 
guaranteed  to  produce  a  positive  definite  matrix)  was  never  invoked  in  any  of  our  test  runs. 

''It  can  be  argued  that  this  reduced  data  set  gives  a  better  measure  of  the  relative  efficiency  of  algorithms. 
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Table  (1) 

Performance  of  algorithms  on  representative  problems. 


#/ 

#9 

41 

30 

37 

27 

/(*') 

m 

0.14d-18 

0.68d-08 

0.47d-15 

0.13d-06 

0.30d-14 

0.10d-05 

0.36d-14 

0.19d-05 

0.57d-02 

0.27d-05 

0.57d-02 

0.99d-06 

0.57d-02 

0.21d-05 

0.57d-02 

0.24d-05 

0.19d-18 

0.15d-08 

0.62d-19 

0.87d-10 

0.86d-16 

0.34d-07 

0.11d-13 

0.15d-06 

0.84d-05 

0.40d-05 

0.84d-05 

0.38d-05 

0.84d-05 

0.89d-06 

0.84d-05 

0.14d-05 

0.47d-01 

0.59d+00 

0.17d-03 

0.21d-03 

0.14d-03 

0.14d-05 

0.17d-03 

0.75d-03 

0.20d-13 

0.26d-06 

0.12d-10 

0.47d-05 

0.26d-14 

0.11d-06 

0.19d-12 

0.59d-06 

0.27d-03 

0.87d-05 

0.27d-03 

0.17d-05 

0.27d-03 

0.11d-05 

0.27d-03 

0.17d-05 

0.61d-09 

0.14d-05 

0.12d-08 

0.58d-05 

0.31d-08 

0.50d-05 

0.14d-09 

0.50d-06 

0.30d-08 

0.31d-05 

0.16d-10 

0.85d-05 

0.15d-07 

0.11d-04 

0.16d-09 

0.41d-05 

0.39d-14 

0.25d-05 

0.21d-16 

0.12d-06 

0.59d-14 

0.16d-05 

0.16d-15 

0.27d-06 

*  Algorithm  exceeded  maximum  number  of  iterations  allowed. 
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Table  (2) 

Summary  of  performance  of  algorithms  on  26  test  problems. 


Procedure 

Number 

of 

iterations 

Number 

of 

corrections 

Number  of 
evaluations 

_ f _ 2 

Performance 

measures 

_ (A) _ (B) _ 

0 

1047 

0 

1309 

1047 

2356 

9525 

1 

902 

108 

1329 

902 

2231 

8963 

2 

696 

111 

946 

807 

1753 

6637 

3 

939 

186 

1268 

939 

2207 

8923 

Table  (3) 

Test  data  normalized  with  respect  to  control  (unmodified  BFGS). 


Procedure 

Normalized 
count  of 

iterations  corrections 

Normalized 
count  of 
evaluations 

_ f _ 2 _ 

Normalized 

performance 

measures 

(A)  (B) 

0 

1.00* 

0.00 

1.00 

1.00 

1.00 

1.00 

1 

.86* 

.12 

1.00 

.86 

.95 

.94 

2 

.66* 

.16 

.72 

.77 

.74 

.70 

3 

.90* 

.20 

.97 

.90 

.94 

.94 

0 

1.00  f 

0.00 

1.00 

1.00 

1.00 

1.00 

1 

.83* 

.14 

.98 

.83 

.91 

.89 

2 

.72t 

.17 

.75 

.84 

.79 

.85 

3 

.86^ 

.18 

.93 

.86 

.90 

.88 

*  Data  from  all  26  test  problems. 

+  Data  from  reduced  set  of  the  23  test  problems  for  which  all  methods  converged, 
problems  of  larger  dimension  (n  =  10  to  n  =68)  .  The  unmodified  BFGS  failed6  on  21  out  of  the 
144  cases  while  procedures  (l),  (2),  and  (3)  failed  on  16,  12,  and  20  respectively.  Thus,  at  least  for 
this  test  set,  procedure  (l)  increases  the  reliability  of  the  BFGS  method  by  25%,  and  procedure  (2) 
increases  the  reliability  by  45%.  Other  performance  indicators  for  this  expanded  test  set  are 
roughly  comparable  with  those  shown  in  Table  (3),  with  the  exception  that  procedures  (1)  and  (3) 
performed  relatively  poorly  on  the  large  dimensional  problems.6 

Given  the  excellent  numerical  performance  of  procedure  (2),  the  remainder  of  our  investiga¬ 
tions  focus  on  this  technique.  In  order  to  examine  performance  of  this  procedure  when  invoked 
with  a  greater  or  lesser  frequency,  we  devised  alternative  tests  for  triggering  the  extra  update.  The 
first  such  test  defines 


5  Several  criteria  were  used  to  define  “failure”,  such  as  still  being  far  from  the  solution  after  200  iterations. 

®Each  option  failed  on  the  same  problems  that  the  unmodified  BFGS  failed  on.  Furthermore,  procedure  (1)  was  less 
efficient  by  a  factor  of  30%.  However,  this  is  not  conclusive  due  to  the  small  size  of  the  problem  set. 
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ck  =  max  {m2ck-u  vk{gk+l  -  gk)/PkPk}  (4-1) 

and  triggers  an  extra  update  whenever 

e {Bk+\,  9k+i)  >  ™ick,  (4.2) 

where  mjEfO,  oo]  and  m2  E  [0,  l]  are  constants.  The  constant  m2  controls  how  much  “memory” 
the  test  has  of  previous  curvature  estimates.  For  example,  m2  =  0  corresponds  to 
ck  =  max  {0,  vk{gk+\  —  9k)IVkVk}'-  the  current  curvature  estimate.  The  constant  mx  allows  a  more 
direct  control  of  how  often  corrections  are  triggered:  mi  =  oo  corresponds  to  the  unmodified  BFGS 
method,  while  mx  =  0  corresponds  to  performing  an  extra  update  at  every  iteration.  We  tested  this 
new  triggering  formula  on  the  original  26  problems  using  two  different  starting  points  for  each  (1* 
and  10*  the  default  values)  with  the  values  mi  E  {oo,  4, 2, 1, 0.5, 0.25, 0}  and 
m2E  {1,0.9,  0.7,0.5,0.01}. 

Several  trends  were  noticed  among  the  performance  indicators  as  mx  was  varied  from  oo  to  0. 
The  percentage  of  iterations  at  which  the  model  Hessian  Bk  was  corrected  varied  monotonically 
from  0%  to  95%, 7  but  the  number  of  iterations  required  (as  compared  to  the  unmodified  BFGS) 
varied  monotonically  from  100%  to  55%.  The  performance  indicators  for  m1>  2  showed  little 
change  from  the  BFGS  in  efficiency.8  For  l>m,>0.25,  the  efficiency  was  improved  by  roughly 
15%  (  using  criteria  (A)  )  and  by  10%  (  using  criteria  (B)  );  for  mj  =  0,  the  efficiency  was 
unchanged  (using  criteria9  (B)  )  and  improved  by  20%  (  using  criteria  (A)  ).  The  number  of 
failures  dropped  from  8  to  2.  Experiments  with  m2  showed  similar  (although  slightly  less  clear) 
trends. 

These  trends  can  be  summarized  by  saying  the  algorithm  performs  quite  well  for  mi  E  [0, 1]. 
A  value  of  mx  =  l/z  and  m2=l  is  suggested  for  typical  problems,  although  miE[0,  %]  and 
m2=  1/100  may  be  preferable  if  the  gradient  is  very  inexpensive  to  evaluate  or  if  reliability  is  very 
important.  Table  (4)  presents  a  small  sample  of  our  data  from  this  part  of  the  investigation. 

Corrections  were  not  made  at  the  initial  and  final  iterations. 

was  modified  in  fewer  than  5%  of  the  iterations,  which  is  apparently  not  enough  to  improve  efficiency.  Howev¬ 
er,  even  this  small  percentage  was  enough  to  roughly  halve  the  failure  rate  for  this  test  set. 

®This  is  hardly  surprising,  since  we  have  roughly  halved  the  number  of  iterations  while  doubling  the  number  of  gra¬ 
dient  evaluations  per  iteration. 
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Table  (4) 

Test  of  triggering  corrections  using  equations  (4-1)  and  (4-2). 


m  1 

m2 

Normalized 
count  of 

iterations  corrections 

Normalized 
count  of 
evaluations 

_ L _ 2 _ 

Normalized 
performance 
measures 
(A) _ (B) 

Normalized 

count 

of 

failures 

00 

- 

'  ‘  '  '  ; 

1.00 

1.00 

warn 

1.00 

m 1 

0.76 

0.82 

0.50 

Wmm 

■ 

0.84 

0.93 

0.86 

0.38 

mmn 

0.57 

1.07 

0.25 

00 

- 

1.00 + 

0.00 

■SB 

1.00 

- 

1.0 

1.00 

0.83  + 

0.11 

wsm 

0.92 

- 

0.24 

Ha 

0.95 

- 

MM 

1  0.56 T 

0.94 

0.55 

1.08 

- 

*  Data  from  full  set  of  52  test  cases. 

+  Data  from  reduced  set  of  the  43  test  problems  for  which  all  methods  converged. 


An  alternative  to  the  test  c(S*+1,  gk+i)  >  ek  is  suggested  by  equations  (2.15)  and  (2.16)  in 
Theorem  2.  From  the  secant  equation  and  the  Lipschitz  continuity  of  g,  we  have  that  the 
unmodified  update  satisfies 


Pk^k+iPk  <L  ||p*  ||2  (4.3) 

so  that 

c(Bk+l,Pk)<L  .  (4.4) 

Let  0max  be  any  fixed  angle  in  (0,  -^-),  and  let  ©t+1  be  the  angle  between  pk  and  gt+1.  If 

a 

cos0*+i  >cos0max  and  if  pk  &Sk+i,  then  by  (2.16)  we  have  the  uniform  predicted  decrease  condi¬ 
tion  (2.3)  at  the  next  iteration  without  requiring  a  correction  (/30  and  Pi  are  taken  to  be 

7T 

L  /cos  0max  and  P2COS  ©max?  respectively).  For  example,  if  ©H-i>©max=  ~T  and  /?2=sl>  we  have 
that 


predk+i(pk+i)  >  max  {predk+i(p ):  p  =  apk,  ||p  ||  <  A*+1} 

>  -i- cos  ©*+1  ||?*+1  1 1 min  |a*+1,  cos ©*+1 

„  1  ,  V2,  „  •  L  H»+i  II 

>  y(— )  II 9k+i  Hmin  1A*+i’  V2 L  " 

For  OLC  type  methods  of  computing  trial  steps,  Sk  is  taken  to  be  !Rn ;  hence  for  these  methods  we 
can  trigger  a  correction  if 


lk+ill  1 

L  J  (4.5) 
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COS  ©Jfe+i  <  COS  ©max  •  (4-6) 

Numerical  tests  were  performed  using  this  alternative  criteria  (by  itself  and  in  several  Boolean 
combinations  with  the  original  test)  for  various  values  of  ©max.  Although  this  new  trigger  was  also 
successful  in  improving  performance  and  reliability,  our  results  were  somewhat  erratic  as  ©max  was 
changed  in  comparison  to  the  uniform  performance  increases  exhibited  by  the  original  trigger  as 
m !  and  wi2  were  changed  (within  the  interval  (0,  1]).  At  present,  we  therefore  recommend  the  ori¬ 
ginal  test  over  this  alternative. 

5.  Concluding  remarks. 

5.1.  Summary.  An  important  element  in  the  global  convergence  theory  of  trust  region  algo¬ 
rithms  is  the  uniform  predicted  decrease  condition,  which  is  usually  established  by  assuming  a  uni¬ 
form  upper  bound  on  the  sequence  of  Hessian  approximations.  However,  uniform  predicted 
decrease  is  also  implied  by  the  weaker  condition  of  a  uniform  upper  bound  on  c  ( Bk  ,</*).  This  con¬ 
dition  can  be  enforced  in  secant  methods  by  several  procedures.  The  best  one  of  these  procedures 
involves  performing  an  extra  secant  update  to  correct  B *  in  the  gradient  direction  at  every  itera¬ 
tion  at  which  c  ( Bk ,  gk )  is  larger  than  an  estimate  of  the  maximum  problem  curvature.  Numerical 
testing  suggests  that  this  technique  increases  the  reliability  of  secant  methods  by  a  factor  of  two  or 
more,  while  actually  increasing  several  measures  of  algorithm  efficiency  by  10  to  20  percent. 

It  should  be  pointed  out  that  this  increase  in  efficiency  is  a  surprising  result,  particularly 
since  the  increase  is  remarkably  insensitive  to  how  frequently  corrections  are  performed.  Our  ini¬ 
tial  goal  was  merely  to  find  a  procedure  which  increased  the  reliability  of  secant  methods  without 
immoderately  decreasing  the  efficiency.  We  also  expected  our  procedure  to  work  best  when  correc¬ 
tions  were  done  rarely;  our  experiments  showed  instead  the  utility  of  frequent  extra  updates. 

Another  point  that  should  be  restated  is  that  the  uniform  predicted  decrease  condition  is  not, 
of  itself,  sufficient  to  imply  global  convergence.  This  condition  must  be  coupled  with  conditions 
about  how  the  trust  radius  is  updated  at  each  iteration.  Specifically,  global  convergence  can  be 
shown  if  (2.3)  holds  and  there  exists  e0  >  0  such  that 
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A*  <  fo  1 1  <7*  11/ =>  ^i+i  >  A*  •  (5-1) 

Traditional  global  convergence  proofs  establish  (5.1)  by  again  using  the  assumption  of  a  uniform 
upper  bound  on  the  sequence  of  Hessian  approximations,  so  that  the  weaker  condition  (of  bounded 
Rayleigh  quotients  in  the  gradient  directions)  used  in  this  paper  might  seem  irrelevant  with  respect 
to  the  overall  theory.  However,  (5.1)  can  also  be  established  (see  [2])  by  a  weaker  condition:  that 
(2.3)  holds  and  that  there  exists  /J3  6  [0,  oo)  such  that 

—  0s  <  c  (Bk ,  p )  for  all  nonzero  p  £  RB  .  (5-2) 

That  is,  global  convergence  can  be  established  if  c  (Bk ,  gk )  is  bounded  above  and  if  the  eigenvalues 
of  Bk  are  bounded  away  from  —  oo.  For  positive  definite  secant  methods,  (5.2)  is  automatic  and 
the  procedures  of  this  paper  guarantee  global  convergence. 

5.2.  Extensions.  The  details  of  the  procedures  we  have  presented  have  been  purposefully  kept 
simple  for  clarity;  many  refinements  are  possible.  For  example,  the  calculation  of  the  finite 
difference  steplength  variable  e  was  dependent  on  a  fixed  parameter  typx ,  which  represented  an 
estimate  of  the  typical  size  of  1 1  x*  1 1 .  A  better  approach  would  be  to  prespecify  some  typmin  >  0 

and  use  typxk  =max{  1 1 arfc  (|,  y(  ||ff*  ||  +  1 1 ar*_i  II),  typmin}.  A  variety  of  other  strategies  for  com¬ 
puting  t  are  discussed  [8]  and  [15]. 

A  more  important  topic  not  yet  mentioned  is  the  subject  of  scaling  matrices.  Most  imple¬ 
mentations  of  trust  region  methods  replace  the  spherical  trust  region  heretofore  assumed  with  the 
hyperellipse  \\Dkp  ||  <  A*  where  Dk  is  a  nonsingular  scaling  matrix.  The  trust  region  subprob¬ 
lem  is  then 

minimize  il)k(xk  +  p):  || Dkp  ||  <  A*  .  (5.3) 

p€Rn 

This  can  be  converged  back  to  standard  form  by  making  the  change  of  variables 

x  =  Dkx  (5.4) 

so  that  x  it  —Dkxk,  p  =  Dkp  ,  etc.  Defining 
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$k{£k  +  p)  =  +p)  =  fk  +  9kP  +^PTBkp  (5.5) 

with 

9  k  =  Dk~T9k  (5-6) 

and 

Bk  =  Dk~TBkDk -1  (5.7) 

allows  us  to  replace  (5.3)  with  the  problem 

minimize  k(x  k  +  p):  ||p~  ||  <  A*  .  (5  gl 

P6R*  v  ’ 

A  step  pk  can  then  be  obtained  by  computing  an  approximate  solution  p  k  to  problem  (5.8)  and 
inverting  transformation  (5.4).  If  { Dk }  and  {D*-1}  are  uniformly  bounded,  the  global  convergence 
results  of  this  paper  can  still  be  obtained  provided  there  exists  /?0  <  00  such  that 

c{Bk,g~k)  <p0-  (5.9) 

Our  safeguarding  techniques  should  therefore  be  used  to  enforce  either 

c(Bk,gk)<ck  (5.10a) 

with  c  k  =  max{  c  ,  p  /(  g  k+l-g  k)/p  kp  k  },  or 

c(Bk,(DkTDk)-'gk)<ck  ,  (5.10b) 

depending  on  which  is  more  convenient  for  a  given  implementation. 

Some  other  possible  refinements  are  based  on  secant  updates  which  preserve  past  information. 
Davidon  [3]  and  Schnabel  [23]  have  developed  classes  of  updates  which  satisfy 

Bk+iPk  =  gk+i-gk  and  (Bk+1  -  Bk)Uk  =  0  ,  (5.11) 

where  Uk  £  lRnXm  is  a  matrix  whose  columns  form  an  orthonormal  basis  for  the  space  spanned  by 
one  or  more  previous  search  directions.  Sorensen  [25]  suggests  a  broader  class  of  updates  which 
satisfy  the  weaker  conditions 

Bk+ 1  pk  =  gk+ 1  -  gk  and  Ujf(Bk+1  -  Bk)Uk  =  0  .  (5.12) 

One  application  of  an  update  which  can  satisfy  (5.11)  is  obvious:  under  many  conditions  our  extra 
update  in  procedure  (2)  can  be  executed  so  that 
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Bk+iPc=yc  and  (Bk+1  -  Bk+l)pk  =  0  (5.13) 

(where  Bk+l  =  U  (Bk,  pk,yk)),  and  hence  our  new  approximation  to  the  Hessian  satisfies  both 
Bk+\Pk  —  Vk  and  Bk+Xpt  —  yc.  Similarly,  procedure  (3)  can  be  modified  using  updates  satisfying 
either  (5.11)  or  (5.12)  so  that  our  Hessian  approximation  satisfies  both  Bk+lpk=yk  and 
gk+ 1  Bk+lgk+l/ gk+l  gk+1  =  Fk  (where  ck  is  computed  as  in  procedure  (1)). 

If  updates  satisfying  (5.11)  or  (5.12)  are  used,  the  alternative  trigger  (4.6)  previously  sug¬ 
gested  might  also  become  competitive  with  triggering  extra  updates  whenever  c(Bk+1,  </*+1)  >  ck. 
Rather  than  defining  ©t+1  to  be  the  angle  between  gk+x  and  pk ,  we  can  define  it  to  be  the  angle 
between  gk+x  and  the  “nearest”  element  of  span  {pk,  pk-\, ...,  pk~m}-  Specifically,  cos©i+1  = 
9k(Qk9k)/  II?*  II  lift?*  II  where  Qk  is  the  projection  matrix  Uk  U£. 

This  list  of  possible  refinements  to  our  techniques  is  not  meant  to  be  exhaustive,  but  should 
be  indicative  of  topics  for  future  research.  Our  numerical  tests  were  confined  to  the  BFGS  method, 
but  our  procedures  can  be  applied  to  other  secant  methods,  such  as  the  DFP  update  and  the  “rank 
one”  update.  Safeguarding  seems  particularly  appropriate  for  these  two  updates,  since  they  are 
usually  very  effective,  but  under  certain  circumstances  may  generate  Hessian  approximations  with 
inappropriately  large  eigenvalues.  A  numerical  comparison  of  the  BFGS,  DFP,  and  rank  one 
updates  with  and  without  safeguarding  by  procedure  (2)  is  currently  being  performed.  Our  pro¬ 
cedures  can  also  be  easily  extended  to  such  classes  of  updates  as  sparse  secant  methods  and  struc¬ 
tured  secant  methods.  A  description  of  procedure  (2)  as  applied  to  a  structured  secant  method 
associated  with  nonlinear  least  squares  problems  can  be  found  in  [1]. 


10Depending  on  how  the  algorithm  is  defined,  this  set  may  also  include  pe  steps  from  one  or  more  past  iterations. 
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